Voice quality normalization in an utterance for robust ASR
نویسندگان
چکیده
In this paper, we propose a novel method of normalizing the voice quality in an utterance for both clean speech and speech contaminated by noise. The normalization method is applied to the N-best hypotheses from an HMM-based classifier, then an SM (Sub-space Method)-based verifier tests the hypotheses after normalizing the monophone scores together with the HMMbased likelihood score. The HMM-SM-based speech recognition system was proposed previously [1, 2] and successfully implemented on a speaker-independent word recognition task and an OOV word rejection task. We extend the proposed system to a connected digit string recognition task by exploring the effect of the voice quality normalization in an utterance for robust ASR and compare it with the HMM-based recognition systems with utterance-level normalization, word-level normalization, monophone-level normalization, and state-level normalization. Experimental results performed on connected 4digit strings showed that the word accuracy was significantly improved from 95.7% obtained by the typical HMM-based system with utterance-level normalization to 98.2% obtained by the HMM-SM-based system for clean speech, from 88.1% to 91.5% for noise-added speech with SNR=10dB, and from 72.4% to 76.4% for noise-added speech with SNR=5dB, while the other HMM-based systems also showed lower performances.
منابع مشابه
Acoustic Assessment of Disordered Voice with Continuous Speech Based on Utterance-Level ASR Posterior Features
Most previous studies on acoustic assessment of disordered voice were focused on extracting perturbation features from isolated vowels produced with steady-state phonation. Natural speech, however, is considered to be more preferable in the aspects of flexibility, effectiveness and reliability for clinical practice. This paper presents an investigation on applying automatic speech recognition (...
متن کاملAuditory Filterbank Improves Voice Morphing
This paper presents a new method for vocal tract length (VTL) estimation and normalization based on a gammachirp auditory filterbank (GCFB) to improve the sound quality in voice morphing. VTL ratios between 28 speakers were estimated based on the spectral distances for all permutations (756 = 28P27) . The VTL estimation using the mel-frequency filterbank (MFFB), which is a preprocessor for calc...
متن کاملEvaluation of voice activity detection by combining multiple features with weight adaptation
For noise-robust automatic speech recognition (ASR), we propose a novel voice activity detection (VAD) method based on a combination of multiple features. The scheme uses a weighted combination of four conventional VAD features: amplitude level, zero crossing rate, spectral information, and Gaussian mixture model (GMM) likelihood. The weights for combination are adaptively updated using minimum...
متن کاملBeginning of utterance detection algorithm for low complexity ASR engines
In this paper, a novel method for beginning of utterance detection is proposed for low complexity ASR systems. Assuming MFCC calculations in the ASR front-end, the additional computational load due to the algorithm is negligible. The algorithm makes use of the delay between the MFCC calculation and decoding process, which is typical in front-ends with feature normalization. The main steps of th...
متن کاملRestoring Incorrectly Segmented Keywords and Turn-Taking Caused by Short Pauses
Appropriate turn-taking is an important issue in spoken dialogue systems. Especially in ones that feature quick responses, a user utterance is often incorrectly segmented by voice activity detection (VAD) because of short pauses within it. Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads t...
متن کامل